Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling

نویسندگان

چکیده

We propose novel deep speaker representation learning that considers perceptual similarity among speakers for multi-speaker generative modeling. Following its success in accurate discriminative modeling of individuality, knowledge (i.e., using neural networks) has been introduced to However, the conventional algorithm does not necessarily learn embeddings suitable such modeling, which may result lower quality and less controllability synthetic speech. three algorithms utilize a matrix obtained by large-scale scoring speaker-pair similarity. The train encoder with different representations matrix: set vectors, Gram matrix, graph. Furthermore, we an active iterates training. To obtain while reducing costs training, selects unscored speaker-pairs be scored next on basis sequentially-trained encoder's prediction results. Experimental evaluation results show 1) proposed strongly correlated similarity, 2) improve speech autoencoding tasks better than d-vectors learned 3) achieves higher 4) {vector, graph} embedding algorithms, first best third gives most improvement naturalness.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Attacking Speaker Recognition With Deep Generative Models

In this paper we investigate the ability of generative adversarial networks (GANs) to synthesize spoofing attacks on modern speaker recognition systems. We first show that samples generated with SampleRNN and WaveNet are unable to fool a CNN-based speaker recognition system. We propose a modification of the Wasserstein GAN objective function to make use of data that is real but not from the cla...

متن کامل

Deep Speaker Feature Learning for Text-Independent Speaker Verification

Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional timedelay deep neural network stru...

متن کامل

Multi-Speaker Language Modeling

In conventional language modeling, the words from only one speaker are represented at a time, even for conversational tasks such as meetings and telephone calls. In a conversational or meeting setting, however, different speakers can influence each other. In order to recover this missing inter-speaker information, in this work we present a novel approach for conversational language modeling tha...

متن کامل

Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation

In this paper, we introduce an ensemble speaker modeling using a speaker adaptive training (SAT) deep neural network (SAT-DNN). We first train a speaker-independent DNN (SIDNN) acoustic model as a universal speaker model (USM). Based on the USM, a SAT-DNN is used to obtain a set of speaker-dependent models by assuming that all other layers except one speaker-dependent (SD) layer are shared amon...

متن کامل

Modeling intra-speaker variability for speaker recognition

In this paper we present a speaker recognition algorithm that models explicitly intra-speaker inter-session variability. Such variability may be caused by changing speaker characteristics (mood, fatigue, etc.), channel variability or noise variability. We define a session-space in which each session (either train or test session) is a vector. We then calculate a rotation of the session-space fo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2021

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2021.3059114